PeruNPDB: the Peruvian Natural Products Database for in silico drug screening

Since the number of drugs based on natural products (NPs) represents a large source of novel pharmacological entities, NPs have acquired significance in drug discovery. Peru is considered a megadiverse country with many endemic species of plants, terrestrial, and marine animals, and microorganisms. NPs databases have a major impact on drug discovery development. For this reason, several countries such as Mexico, Brazil, India, and China have initiatives to assemble and maintain NPs databases that are representative of their diversity and ethnopharmacological usage. We describe the assembly, curation, and chemoinformatic evaluation of the content and coverage in chemical space, as well as the physicochemical attributes and chemical diversity of the initial version of the Peruvian Natural Products Database (PeruNPDB), which contains 280 natural products. Access to PeruNPDB is available for free (https://perunpdb.com.pe/). The PeruNPDB’s collection is intended to be used in a variety of tasks, such as virtual screening campaigns against various disease targets or biological endpoints. This emphasizes the significance of biodiversity protection both directly and indirectly on human health.


Scientific Reports
| (2023) 13:7577 | https://doi.org/10.1038/s41598-023-34729-0 www.nature.com/scientificreports/ balsamum), which was used wide-reaching for the treatment of wounds 16 , can be mentioned. However, the potential of Peruvian NPs remains underexploited since most of these useful native species can be domesticated or semi-domesticated 17 . Also, the amount and nature of experimental evidence published on active NPs are still limited 18 , and most of the current studies reported crude medicinal activities, while potentially active NPs have been isolated only from a few numbers of plants 19 . Computer-aided drug design (CADD), one of the key approaches to modern pre-clinical drug discovery, can be defined as computational methods that are applied to discover, develop, and analyze drugs and active molecules 20 . Among the key approaches that comprise CADD, virtual screening is one of the major contributors to CADD since it stands as a contemporary approach to the experimental in vitro high-throughput screening (HTS) for hit identification and optimization 21 . Integrating CADD approaches to curated databases, which are described as a well-organized collection of data in any field, the drug development process may be sped up and cost reduced 22 . Considering this, large databases containing NPs from various data sources have been released, such as the COlleCtion of Open Natural prodUcTs (COCONUT), which contains 406,076 unique "flat" NPs, and a total of 730,441 NPs where stereochemistry has been preserved 23 ; and the LOTUS initiative, which has 750,000 referenced structure-organism pairs 24 . Also, several NPs compound databases from particular geographical locations have been assembled, such as the Traditional Chinese Medicine (TCM) Database@Taiwan database containing approximately 58,000 molecules 25 ; the Indian Medicinal Plants, Phytochemistry and Therapeutics 2.0 (IMPPAT 2.0) which contains more than 10,000 phytochemicals 26 ; and the AfroDB which is composed of around 1000 NPs 27 . Likewise, some countries in Latin America have published their own public NPs databases such as NuBBEDB which contains more than 2000 NPs 28 , and SistematX which contains more than 2500 NPs 29 , both from Brazil, and BIOFACQUIM from Mexico, which contains a total of 531 molecules 30 . Furthermore, NPs databases had been used as a repository to identify several promising candidates to be considered for further development for the treatment of diseases 31 , such as Chagas disease 32,33 , Tuberculosis 34 , Leishmaniasis 35,36 , Schistosomiasis 37 , and COVID-19 38 . The present work introduces the first version of the Peruvian Natural Products Database (PeruNPDB), describing its assembly, curation, and chemoinformatic characterization of molecular diversity and coverage in chemical space. The database is freely available at the web-interface PeruNPDB Explorer (https:// perun pdb. com. pe/). We anticipate that the PeruNPDB will make it possible to conduct additional virtual screening tests to create innovative pharmacological entities and other biotechnological approaches and serve as a resource for information on conservation guidelines.

Methods
Search strategy, study selection, and data extraction. A systematic review search strategy to examine the literature for studies describing NP from Peruvian sources was adapted from 39 . Whereas PubMed, the main database for the health sciences, maintained by the National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM), is a database that contains about 32 million citations, belonging to more than 5300 journals currently indexed in MEDLINE 40 ; it provides uniform indexing of biomedical literature, the Medical Subject Headings (MeSH terms), which form a controlled vocabulary or specific set of terms that describe the topic of a paper consistently and uniformly 41 . Firstly, to find terms associated in the literature with Peruvian NPs, the MeSH terms "Peru" AND the "Natural Products" were employed in a search carried out at the PubMed database (https:// pubmed. ncbi. nlm. nih. gov/), (last searched on 10 June 2022), though the results were plotted into a network map of the co-occurrence of MeSH terms in the VOSviewer software (version 1.6.17) 42 , which employs a modularity-based method algorithm to measure the strength of clusters 43 . The resultant cluster content was analyzed to select relevant studies associated with Peruvian NPs. Three phases went into selecting the studies. First, papers written in languages other than English, copies of articles, reviews, and metaanalyses were disregarded. The highly relevant full studies were then retrieved and separated from the papers with a title or abstract that did not provide enough information to be included. Next, the titles and abstracts of the publications chosen through the search approach were visually evaluated. The data supplied from each investigation contained the NP's characterization as well as details on the genus and species of the sources from which the NP were isolated. Additionally, the information from the bibliographic reference was extracted, even if all research that discussed chemicals derived from Peruvian natural sources was already considered.
PeruNPDB assemble and molecular properties calculation. The simplified molecular-input lineentry system (SMILES) 44 of compounds previously described in the NPs selected in the previous step were searched and retrieved from PubChem 45 , DrugBank 46 , or ChEMBL 47 servers, while for unavailable NPs the ChemDraw tool 48 was employed to generate the SMILE notation. Moreover, the Osiris DataWarrior v05.02.01 software 49 was employed to generate the dataset's structure data files (SDFs). This followed the uploading to the Konstanz information miner (KNIME) Analytics Platform 50 , where the "Molecular Type Cast", and the "RDKit Structure Normalizer" KNIME nodes were employed to curate the chemical structures on the dataset. Moreover, for every compound in the dataset, the classification system for describing small molecule structures is described based on NP Classifier 51 , which employs a biosynthetic ontology that is specific to natural products; or ClassyFire 52 which is a general classification system for small molecules that are based on the ChemOnt ontology, was employed. The KNIME's "RKDit Descriptor Calculator" node was employed to calculate six physicochemical properties of therapeutic interest, namely: molecular weight (MW), octanol/water partition coefficient www.nature.com/scientificreports/ for visualization, and the One-way ANOVA followed by Dunnett correction for multiple comparisons test was employed to evaluate the differences between the datasets. The results were considered statistically significant when p<0.05.

Visual representation of chemical space.
To generate a visual representation of the chemical space of the PeruNPDB, two visualization methods, for the auto-scaled six properties of pharmaceutical interest, namely: MW, ClogP, TPSA, clogS, HBD, and HBA, were employed: principal component analysis (PCA), which reduces data dimensions by geometrically projecting them onto lower dimensions called principal components (PCs) 53 calculated by the "PCA" KNIME node. The second technique was the t-distributed stochastic neighbor embedding (t-SNE), which is a nonlinear dimension reduction in which Gaussian probability distributions over highdimensional space are constructed and used to optimize a student t-distribution in low-dimensional space 54 , calculated by the t-SNE (L. Jonsson) KNIME node. Three and two-dimensional scatter-plot representations were generated for PCA and t-SNE, respectively with the Plotly KNIME node. Additionally, the Tanimoto similarity score was calculated for clustering the compounds, while the atom-pair-based fingerprints of the NPs were obtained using the "ChemmineR" package 55 in the R programming environment (version 4.0.3) 56 , a heatmap was generated for visualization. The same procedure was employed in the reference datasets: AfroDB 27 , BIOFAQUIM 30 , and NUBBEDB 28 retrieved from the ZINC20 database 57 .
Global diversity: consensus diversity analysis. Since chemical diversity strongly depends on the structure representation, it is reasonable to consider multiple representations for a complete global assessment. The consensus diversity (CD) plots have been proposed as simple two-dimensional graphs that enable the comparison of the diversity of compound data sets using four sets of structural representations: the molecular fingerprints, scaffolds, molecular properties, and the number of NPs 58 . The multiple-variable plot was generated by GraphPad Prism software version 9.4.0, whereas the y-axis represents the area under the cyclic system recovery curve 59 , the x-axis, represents the median of the fingerprint-based diversity computed with Molecular Access System (MACCS) keys (166-bits) and the Tanimoto coefficient 60 , the bubble color represents the molecular properties of pharmaceutical interest, and the bubble size represents the number of NPs for each database.
Drug-likeness. The Osiris DataWarrior v05.02.01 software 61 was employed to calculate the drug-likeness score of the compounds from the PeruNPDB; the calculation is based on a library of 5300 substructure fragments and their associated drug-likeness scores. This library was prepared by fragmenting 3300 commercial drugs as well as 15,000 commercial non-drug-like Fluka NPs 61 . Frequency distribution of the obtained scores was performed at GraphPad Prism software version 9.4.0 for Windows, GraphPad Software, San Diego, California USA (http:// www. graph pad. com), and plotted into stacked bar plots. Furthermore, the Lipinski Rule-of-5 (Ro5) is a set of four rules (logP, MW, and H-bond donor and acceptor cut-offs) for drug-likeness and oral bioavailability derived from a subset of 2245 drugs 62 . For this Lipinski's Ro5 KNIME node was employed to assess the number of violations to the rule for each compound on the PeruNPDB and plotted into pie charts. The US Food and Drug Administration (FDA)-approved drugs dataset 57 , was employed as a reference, whereas the same procedures were applied to their compounds. Also, the chemical space representation was analyzed, and the procedures were the same as described earlier.

Results
PeruNPDB assemble. In the present study, the assembly of the PeruNPDB, followed by its chemoinformatic characterization on molecular diversity and coverage of the chemical space was performed; to select the studies from which the NPs will further retrieve, a search with the MeSH Terms "Peru" AND "Natural Products" was performed in the Pubmed database, followed by the construction of a network map of the co-occurrence of MeSH terms. The workflow proposed in Fig. 1 was considered. The search resulted in 399 published papers between 1950-2021, whereas establishing the value of five as the minimum number of occurrences of keywords, a map with 194 keywords that reaches the threshold was constructed ( Fig. 2A). In the analysis of the map, it is shown that six main clusters were formed, while terms such as "Plant Extracts", "Plants, medicinal", "Phytotherapy", "Ethnopharmacology", "Ethnobotany", "Plants stems", "Plants bark", and "Seeds", which are associated with NPs were observed in the first cluster (red color). Also, terms such as "Peru", "Humans", "Animals", and "Male", were recurrent terms. Although using the eligibility criterion established, 47 articles were selected which showed a 2000-2021-year range, and terms such as "Flavonoids", "Sesquiterpenes", and "Anthocyanins", were recurrent terms (Fig. 2B). Also, bibliographic data extracted from the selected articles analyzed: the "Journal of Agricultural and Food Chemistry", the "Journal of Ethnopharmacology", "Phytochemistry", and "Planta Medica" where the main peer-reviewed journals were the studies describing compounds extracted from Peruvian NPs were published (Fig. 2C). Furthermore, while retrieving the SMILES of the compounds from PubChem, DrugBank, and ChEMBL, it was observed that 242 structures were found in the repositories and that 38 needed to be generated in the ChemDraw tool. Ninety-five and five percent of the compounds were retrieved from plant or animal sources, respectively (Fig. 3A). The genus from which most of the compounds were extracted were Uncaria and Lepidium, with 11 and 10 percent, respectively (Fig. 3B). When analyzing the structure of the compounds with a classification system for small molecule structures, it is shown that 76 classes of NPs were found among the 280 NPs of the PeruNPDB, whereas anthocyanidins (N=25), aporphine alkaloids (N=11), cinnamic acids and derivatives (N=17), germacrane sesquiterpenoids (N=13), stigmastane steroids (N=10), and unsaturated fatty acids (N=22) were the most predicted classes of NPs (Fig. 4). and plotted into box plots, which include the distribution of the same properties of the three reference datasets, retrieved from the ZINC20 database (Fig. 5). To compare the results of the datasets, the coefficient of variation (CV) was calculated, which represents the ratio of the standard deviation to the mean and is considered a useful tool to statistically compare the degree of variation from one dataset to another 54 . Besides the results of the HBA, in which NuBBEDB obtain the highest CV (123.2%), the PeruNPDB showed the highest CV in MW, clogP, TPSA, clogS, and HBD with 46.58%, 84.49%, 112.8%, 50.08%, and 83.84%, respectively. Still, the results from TPSA, clogP, clogS, and HBD showed high statistical differences compared to AfroDB, BIOFAQUIM, and NuBBEDB, while showed no statistical difference in HBA results compared to the AfroNP database (Fig. 5).
Visualization of the chemical space. The chemical space visualization of PeruNPDB was conducted using PCA and t-SNE. Though the visual analysis of 3D-PCA shows that molecules in PeruNPDB share the chemical space roughly with NuBBEDB. Whereas in some regions the molecules of PeruNPDB are predominant (Fig. 6A). While the explained variance percentage of PC1, PC2, and PC3 was 50.24, 39.94, and 6.72, respectively. PeruNPDB, BIOFAQUIM, and NuBBEDB chemicals overlap in most of the chemical space represented, according to the 2D-t-SNE visual analysis (Fig. 6B).

Diversity analysis.
The heatmap generated using the Tanimoto score matrix and the atom-pair-based fingerprints show that there is a similarity between the structures of the compounds of the PeruNPDB, AfroDB, BIOFAQUIM, and NuBBEDB (Fig. 6C). Also, a consensus diversity plot was used to evaluate the diversity of the PeruNPDB dataset, based on molecular fingerprints, scaffolds, and physicochemical properties. The Euclidean distance of the scaled properties was used to compute the property-based diversity of the PeruNPDB, AfroDB, BIOFAQUIM, and NuBBEDB databases. Data points on a continuous color scale are used to represent the values on the color CD plot. Darker colors signify less diversity, but brighter colors signify more diversity. Finally, different point sizes are used to illustrate how large or tiny the databases are, with smaller data points indicating databases with fewer molecules. The results showed that the diversity of compounds found in the PeruNPDB was the largest since it was found in the area where the highest diversity in scaffold and fingerprints should are found (Fig. 7), which is consistent with the results shown in the box plots (Fig. 6). www.nature.com/scientificreports/ Drug-likeness. Druglikeness assesses qualitatively the chance for a molecule to become an oral drug concerning bioavailability and is established from structural or physicochemical inspections of development compounds advanced enough to be considered oral drug candidates 63 . To assess the "drug-like" profile of the compounds from the PeruNPDB two approaches were performed; firstly, the frequency distribution of the druglikeness score was analyzed, and the results showed that besides the differences in the number of compounds compared in both datasets a similar distribution among the compounds is observed (Fig. 8A). In the second approach, the number of violations to Lipinski's Ro5 was analyzed and the results showed that compounds with at least one violation represent the 85.82 and 76.35% of the FDA and PeruNPDB datasets, respectively (Fig. 8B). Also, the visual representation of the chemical space as PCAs (Fig. 8C) and t-SNE (Fig. 8D) indicates that some of the NPs are distributed in the same space as the already approved drugs. Whereas the explained variance percentage of PC1, PC2, and PC3 was 52.38, 37.64, and 5.54, respectively. The findings imply that because the compounds in PeruNPDB have chemical structures like those of approved medications, they can be used in virtual screening to find possible lead compounds or points for further optimization.

Discussion
Peru has exceptionally high biodiversity, with numerous endemic species of mammals, reptiles, amphibians, flowering plants, and ferns, which is why has been described as a "megadiverse" country 64,65 , but worldwide hotspot analysis for potential conflict between food security and biodiversity conservation points out Peru as a region that is especially at risk of biodiversity loss due to agricultural expansion 66 . Thus, the conservancy of biodiversity can be considered important since historically NPs have played a key role in drug discovery, especially for illnesses such as cancer, cardiovascular and infectious diseases 67 , while the growing interest in NPs and their application is evidenced by a growth of the number of published databases of NPs, and collections of structures from various organisms, geographical locations, targeted diseases, and traditional applications 68 . Currently, www.nature.com/scientificreports/ several NPs or NPs-derived molecules are employed in the treatment of distinct diseases, such as the antibiotic penicillin originally obtained from the fungi Penicillium spp. 69 ; the analgesic aspirin, which is the most used drug in the world, derived from salicin extracted from the bark of the willow trees Salix alba 70 ; and the immunosuppressant tacrolimus employed in the prevention of the rejection organ after transplants, obtained from bacteria Streptomyces tsukubaensis 71 , are some examples. Besides, NPs and their derivatives have been considered promising options to improve treatment efficiency in cancer patients and decrease adverse reactions 72 , whereas vinca alkaloids 73 , taxane diterpenoids 74 , camptothecin derivatives 75 , and epipodophyllotoxin 76 , are NPs-derived anticancer compounds clinically used as chemotherapeutics; while an example of the importance of biodiversity conservation is exemplified by the tree Taxus brevifolia, from which the chemotherapeutic drug paclitaxel was originally extracted, that was put on the list of endangered species 77,78 . According to the data, there are fewer compounds identified in the PeruNPDB than in AfroDB, BIOFAQUIM, and NuBBEDB, but the chemical diversity is also higher. Of the 280 compounds characterized, 95% came from plant sources, and 5% came from animal sources. But in the BIOFACQUIM and NuBBE databases as well as plant sources, compounds derived from fungi, propolis, bacteria, and marine organisms are also described. This partially explains the difference in the TPSA results of the PeruNPDB, since it has been reported that natural products from the animal kingdom have the highest TPSA due to the number of hydrogen bond donors and acceptors 79 . Furthermore, the Peruvian marine biodiversity hotspot located on the northern coast has been predicted to hold 501 species, 270 genera, and 193 families 80 , as marine natural products have shown an interesting array of diverse and novel chemical structures with potent biological activities 81 , which includes: Cephalosporin C an antibiotic derived from marine fungi Cephalosporium 82 , Eribulin an anticancer drug derived from halichondrin B from the natural Japanese marine sponge Halichondria okada 83 and the antiviral, isolated from sponge Tethya crypta, nucleoside Ara-A 84 . Also, Peru is considered a diverse country that has a very broad microbial diversity richness, however, remains slightly studied and exploited 85,86 . Fungi, the eukaryotic microorganisms, produce a tremendous number of NPs with diverse chemical structures and biological activities 87 , such as lovastatin, the first statin approved as a hypercholesterolemic medication by the FDA, most frequently produced by Aspergillus terreus 88 , and cyclosporine A, a potent immunosuppressant that was initially used to prevent organ rejection, isolated from the fungal species Tolypocladium inflatum gams 89 . Besides that no current drug has been developed from propolis, it is considered a very rich and complex chemical composition, while about 300 different chemicals components isolated from it, and which composition fluctuates according to parameters such as plant source, seasons harvesting, geography, type of bee flora, climate changes, and honeybee species 90,91 ; highlighting Artepillin C, extracted from Brazilian green propolis, that showed in vitro 92 and in vivo 93 anti-inflammatory potential. These emphasize the urgency to promote and enhance the study of Peruvian NPs quantitatively and qualitatively. Compounds from Peruvian medicinal plants have been evaluated for their antidiabetic 94 , anticancer 95 , antiviral 96 , antibiotic 97 , and antiparasitic activities 98 ; however, most of the studies in the literature were in vitro performed over plants extracts, and little information about the potential of single compounds on these activities is described, while these promising www.nature.com/scientificreports/ results can be explained by synergistic interaction or multi-factorial effects between compounds present in the plant extracts studied 99 . While pharmacodynamic synergy involves multiple substances acting on various receptor targets to enhance the overall therapeutic effect, and pharmacokinetic synergy involves substances with little to no activity helping the main active principle to reach the target by improving bioavailability or by reducing metabolism and excretion, this type of assay can hide the true potential of single molecules activity between different constituents of plant extracts 100 . Thus, the concerted effort of experimental NPs research with CADD is

Conclusion
Here we present the first version of PeruNPDB, a compound database of NPs from Peru that includes 280 compounds from plant and animal sources. PeruNPDB was constructed curated, and maintained by the Computational Biology and Chemistry Research Group from the Universidad Catolica de Santa Maria, and it is freely accessible through the website https:// perun pdb. com. pe/. The PeruNPDB was envisioned as a tool for virtual screening, identifying promising compounds, serving as a springboard for further biotechnological products, and providing suggestions for conservation policies. The chemoinformatic characterization and analysis of the coverage and diversity of PeruNPDB in chemical space suggest broad coverage, overlapping with regions in the drug-like chemical space. The database contains an identification code (ID), the chemical name, bibliographic reference (name of the journal, year of publication, and DOI number), kingdom, genus, and species of the www.nature.com/scientificreports/ natural product, SMILES notation, and classification of the natural product. In the future, we want to launch the PeruNPDB version 2 with new computed molecular descriptors, NP stereochemical data, and the possibility to download several structures at once. The web-based user interface will also be improved and kept, and new NPs from various taxonomic ranks that aren't included in the current edition will be added. Additionally, as we increase the quantity of NPs, we anticipate comparing the PeruNPDB with larger, more varied free datasets that are available in the literature. The complete PeruNPDB dataset for research purposes is available upon request and may be directed to and will be fulfilled by the lead contact Miguel Angel Chavez Fumagalli (mchavezf@ ucsm.edu.pe).

Data availibility
The datasets generated and/or analyzed during the current study are available in the PeruNPDB repository, https:// perun pdb. com. pe/.